Skip to content

Add StatisticsContext parameter to partition_statistics#21815

Open
asolimando wants to merge 2 commits intoapache:mainfrom
asolimando:asolimando/partition-statistics-context
Open

Add StatisticsContext parameter to partition_statistics#21815
asolimando wants to merge 2 commits intoapache:mainfrom
asolimando:asolimando/partition-statistics-context

Conversation

@asolimando
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #20184

Rationale for this change

ExecutionPlan::partition_statistics forces each operator to re-fetch child statistics internally, causing exponential recomputation in deep plans and making it impossible to inject enriched statistics from external sources (e.g., expression-level analyzers, custom statistics providers).

What changes are included in this PR?

Breaking change: the ExecutionPlan::partition_statistics signature changes from (&self, partition: Option<usize>) to (&self, partition: Option<usize>, ctx: &StatisticsContext). Migration guide added to docs/source/library-user-guide/upgrading/54.0.0.md.

Add a StatisticsContext parameter to partition_statistics that carries pre-computed child statistics, and a compute_statistics() utility that walks the plan tree bottom-up, threading child statistics through the context automatically. StatisticsContext carries one Arc<Statistics> per child node and is designed to be extended with additional context (e.g., expression-level analyzers, custom statistics providers) without further signature changes.

Operator categories

  • Leaf nodes (EmptyExec, PlaceholderRowExec, WorkTableExec, DataSourceExec): ignore the context, return their own stats. DataSourceExec delegates to the DataSource trait which has a separate partition_statistics that was not changed.
  • Passthrough operators (BufferExec, CooperativeExec, PartialSortExec, OutputRequirementExec): return ctx.child_stats()[0] directly.
  • Transform operators (FilterExec, ProjectionExec, AggregateExec, WindowAggExec, CoalesceBatchesExec, GlobalLimitExec, LocalLimitExec): use ctx.child_stats()[0] as input, then apply their transformation (selectivity, column projection, grouping cardinality, fetch limit, etc.).
  • Partition-merging operators (CoalescePartitionsExec, SortPreservingMergeExec, SortExec with !preserve_partitioning, RepartitionExec): always need overall child stats regardless of which output partition is requested, since they merge/redistribute input partitions. These call compute_statistics(child, None) internally instead of using the context.
  • Symmetric joins (SortMergeJoinExec): both sides use the same partition, so ctx.child_stats() is correct for both None and Some(i) cases.
  • Asymmetric joins (HashJoinExec CollectLeft, CrossJoinExec, NestedLoopJoinExec): the left (broadcast) side always needs overall stats, so they call compute_statistics(left, None) for the Some case. The right side is partitioned and uses ctx.child_stats()[1] directly. HashJoinExec Partitioned mode is symmetric (both use context). HashJoinExec Auto mode needs overall stats from both sides.
  • Union/Interleave: UnionExec uses ctx.child_stats() for the None case (reduces with stats_union). For Some(partition), Union remaps partition indices across children and calls compute_statistics on the specific child with the remapped index. InterleaveExec uses ctx.child_stats() directly (symmetric across all inputs).

Callers

All direct plan.partition_statistics(None) calls in optimizer rules (JoinSelection, AggregateStatistics, EnforceDistribution), display code, StatisticsRegistry, and tests are replaced with compute_statistics(plan, None).

Tests

No new tests added. This is a no-op refactoring confirmed by all existing tests passing unchanged across all affected crates (datafusion-physical-plan, datafusion-physical-optimizer, datafusion, datafusion-datasource).

What remains for follow-up

  • Adding expression-level analyzers and custom statistics providers to StatisticsContext (eliminates the separate StatisticsRegistry tree walk and the ExpressionAnalyzer injection machinery from Add ExpressionAnalyzer for pluggable expression-level statistics estimation #21122)
  • Extending DataSource::partition_statistics with context if needed
  • Caching to avoid redundant compute_statistics(child, None) calls: partition-merging operators (CoalescePartitions, SortPreservingMerge, etc.) and asymmetric joins (HashJoin CollectLeft, CrossJoin, NestedLoopJoin) currently call compute_statistics(child, None) internally when the requested partition is Some, triggering a separate bottom-up walk. A cache on StatisticsContext keyed by (plan node, partition) would let these reuse already-computed results.

Test plan

  • cargo fmt --all
  • cargo clippy --all-targets --all-features -- -D warnings (affected crates)
  • cargo test --profile ci on datafusion-physical-plan, datafusion-physical-optimizer, datafusion, datafusion-datasource

Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.

Introduce StatisticsContext that carries pre-computed child statistics
and external context for statistics computation. Change the
ExecutionPlan::partition_statistics signature to accept it, and add
compute_statistics() utility for bottom-up computation with automatic
child stats threading.

Update all ~35 in-tree ExecutionPlan implementations and ~40 call
sites. Passthrough operators return ctx.child_stats() directly,
transform operators use it instead of re-fetching from children,
and operators that always need overall child stats (RepartitionExec,
CoalescePartitionsExec, SortPreservingMergeExec, SortExec non-preserving,
HashJoinExec CollectLeft/Auto, CrossJoinExec, NestedLoopJoinExec)
call compute_statistics with None internally.
@github-actions github-actions Bot added documentation Improvements or additions to documentation optimizer Optimizer rules core Core DataFusion crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Apr 23, 2026
@asolimando
Copy link
Copy Markdown
Member Author

Hi @xudong963, I have opened the PR as a prerequisite for #21122, as discussed.

This is a breaking change and I therefore added a section under .../library-user-guide/upgrading/54.0.0.md‎, I have checked around what usually goes there, but I'd appreciate if you could take a deeper look and confirm if I captured what's expected for the update guide.

Looking forward to your feedback!

@xudong963
Copy link
Copy Markdown
Member

@asolimando thanks, I'll review it next Monday! /cc @jonathanc-n

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation optimizer Optimizer rules physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Let partition_statistics accept pre-computed children statistics

2 participants